{
 "cells": [
  {
   "cell_type": "raw",
   "id": "8371e5d6-eb65-4c97-aac2-05037356c2c1",
   "metadata": {},
   "source": [
    "---\n",
    "title: Handle Files\n",
    "sidebar_position: 3\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0d5eea7c-bc69-4da2-b91d-d7c71f7085d0",
   "metadata": {},
   "source": [
    "Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.\n",
    "\n",
    "The general strategy is to use a LangChain [document loader](/docs/modules/data_connection/document_loaders/) or other method to parse files into a text format that can be fed into LLMs.\n",
    "\n",
    "LangChain features a large number of [document loader integrations](/docs/integrations/document_loaders).\n",
    "\n",
    "Let's go over an example of loading and extracting data from a PDF. First, we install required dependencies:\n",
    "\n",
    "```{=mdx}\n",
    "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n",
    "import Npm2Yarn from \"@theme/Npm2Yarn\";\n",
    "\n",
    "<IntegrationInstallTooltip></IntegrationInstallTooltip>\n",
    "\n",
    "<Npm2Yarn>\n",
    "  @langchain/openai zod\n",
    "</Npm2Yarn>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "430806a7-7d8a-4111-9f5d-46840dab0dc0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Module: null prototype] { default: \u001b[36m[AsyncFunction: PDF]\u001b[39m }"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import { PDFLoader } from \"langchain/document_loaders/fs/pdf\";\n",
    "// Only required in a Deno notebook environment to load the peer dep.\n",
    "import \"pdf-parse\";\n",
    "\n",
    "const loader = new PDFLoader(\"./test/data/bitcoin.pdf\");\n",
    "\n",
    "const docs = await loader.load();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50593908",
   "metadata": {},
   "source": [
    "Now that we've loaded a PDF document, let's try extracting mentioned people. We can define a schema like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fb618df7-d7be-4f34-8939-6f7b10dfc2b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { z } from \"zod\";\n",
    "\n",
    "const personSchema = z.object({\n",
    "  name: z.optional(z.string()).describe(\"The name of the person\"),\n",
    "  hair_color: z.optional(z.string()).describe(\"The color of the person's hair, if known\"),\n",
    "  height_in_meters: z.optional(z.string()).describe(\"Height measured in meters\"),\n",
    "  email: z.optional(z.string()).describe(\"The person's email, if present\"),\n",
    "}).describe(\"Information about a person.\");\n",
    "\n",
    "const peopleSchema = z.object({\n",
    "  people: z.array(personSchema),\n",
    "});"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8824d553",
   "metadata": {},
   "source": [
    "And then initialize our extraction chain like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "046ae5ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { ChatPromptTemplate } from \"@langchain/core/prompts\";\n",
    "import { ChatOpenAI } from \"@langchain/openai\";\n",
    "\n",
    "const SYSTEM_PROMPT_TEMPLATE = `You are an expert extraction algorithm.\n",
    "Only extract relevant information from the text.\n",
    "If you do not know the value of an attribute asked to extract, you may omit the attribute's value.`;\n",
    "\n",
    "const prompt = ChatPromptTemplate.fromMessages([\n",
    "  [\"system\", SYSTEM_PROMPT_TEMPLATE],\n",
    "  [\"human\", \"{text}\"]\n",
    "]);\n",
    "\n",
    "const llm = new ChatOpenAI({\n",
    "  modelName: \"gpt-4-0125-preview\",\n",
    "  temperature: 0,\n",
    "})\n",
    "\n",
    "const extractionRunnable = prompt.pipe(llm.withStructuredOutput(peopleSchema, { name: \"people\", }));"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62a92830",
   "metadata": {},
   "source": [
    "Now, let's try invoking it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fb8876a5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{ people: [ { name: \u001b[32m\"Satoshi Nakamoto\"\u001b[39m, email: \u001b[32m\"satoshin@gmx.com\"\u001b[39m } ] }"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await extractionRunnable.invoke({ text: docs[0].pageContent });"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88b52919",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Deno",
   "language": "typescript",
   "name": "deno"
  },
  "language_info": {
   "file_extension": ".ts",
   "mimetype": "text/x.typescript",
   "name": "typescript",
   "nb_converter": "script",
   "pygments_lexer": "typescript",
   "version": "5.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
